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Reconciling saliency and object center-bias 
hypotheses in explaining free-viewing fixations 

Ali Borji and James Tanner 


Abstract —Predicting where people look in natural scenes has 
attracted a lot of interest in computer vision and computational 
neuroscience over the past two decades. Two seemingly contrast¬ 
ing categories of cues have been proposed to influence where 
people look: low-level image saliency and high-level semantic 
information. Our first contribution is to take a detailed look at 
these cues to confirm the hypothesis proposed by Henderson Q 
and Nuthmann & Henderson Q that observers tend to look at the 
center of objects. We analyzed fixation data for scene free-vie wing 
over 17 observers on 60 fully annotated images with various 
types of objects. Images contained different types of scenes, such 
as natural scenes, line drawings, and 3D rendered scenes. Our 
second contribution is to propose a simple combined model of 
low-level saliency and object center-bias that outperforms each 
individual component significantly over our data, as well as 
on the OSIE dataset by Xu et al. 0. The results reconcile 
saliency with object center-bias hypotheses and highlight that 
both types of cues are important in guiding fixations. Our work 
opens new directions to understand strategies that humans use in 
observing scenes and objects, and demonstrates the construction 
of combined models of low-level saliency and high-level object- 
based information. 

Index Terms —visual attention, eye movements, saliency, 
bottom-up attention, free viewing, object saliency, space-based 
attention, object-based attention, center-bias, object center-bias 

1. Introduction 

E ye movements are proxies for overt visual attention. 

They help us understand how humans and animals allo¬ 
cate their perceptual and cognitive resources towards a limited 
portion of the observed visual data. They also inform us 
about characteristics of the filtered data. Understanding and 
modeling human attentional behavior has become increasingly 
important recently for two reasons: 1) the abundance of 
visual data in daily life demands highly efficient filtering 
methods with low computational complexity, specifically when 
dealing with natural scenes and videos, and 2) there are 
many applications in computer vision and robotics such as 
image/video compression, scene understanding, image thumb- 
nailing, photo collages, human-robot interaction, and robot lo¬ 
calization and navigation that could utilize resource allocation 
methods. See Q-GD for comprehensive reviews on visual 
attention. 

Where do people look during free viewing of images of 
natural scenes? A tremendous amount of research in cogni- 
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five and computer vision communities has investigated this 
question for more than a decade, yet it still remains a hot 
topic 0, I p^ . Two types of cue^are believed to infiuence 
eye movements in this task: 1) low-level image features (a.k.a., 
bottom-up visual saliency) such as contrast, edge content, 
intensity bispectra, color, motion, symmetry, and surprise, and 
2) high-level features (i.e., object and semantic information) 
such as faces and people GD-Oz), text p^ , object center 
priors G) 0 image center priors |T9| , ||2Q|, horizontal bias 
in scene viewing (only a left-ward bias for right handers, no 
effect for left handers) 1^ , semantic object distances , 
scene global context emotions |[24| , memory p5| , | 26 , 
gaze direction p7| , j j^ , culture | |29| , and survival-related 
features such as food, sex, danger, pleasure, and pain pQ| , 
GD Note that, while here we focus on a free-viewing task, 
some of these factors also play a role in top-down task-driven 
visual attention |[^-|[^. 


A. Object center-bias 

As an alternative theory for the hypothesis of image-based 
saliency (low-level image features, such as contrast, color, and 
orientation |[4Q|-|[45|), the object-based hypothesis of attention 
considers objects as the unit of attention. The latter relates 
to the cognitive relevance theory and the role of cognitive 
top-down knowledge in attention. According to this theory, 
objects are manipulated to perform a task (e.g., in sandwich 
making |4^). Overall, the idea of object-based attention 


is sensible, as to understand a scene one needs to localize 
objects, identify them, and establish their spatial relations. 
Eye movements tell us how a scene is understood by where 
they land. There has been some debate whether objects or 
saliency better predict fixations and the landscape still remains 
unclear | [T7| , [ [48| . Note that object center-bias is different than 
image center-bias G9’ which is the tendency of observers to 
preferentially look towards the center of images. 

The first fixation-based evidence for object center-bias was 
demonstrated by Henderson Q. He recorded eye movements 
of observers on line drawings of objects and found that 
viewers’ first fixations tended to be near the center of an 
object, and that there was a greater tendency to undershoot the 
center than to overshoot. Later, Trukenbrod and Engbert | [49| 
reported a similar finding on a serial visual search task. A more 


^It is not easy to demarcate the category of some cues (e.g., object 
center-bias, text, face). Some authors have classified cues that influence eye 
movements into three categories: pixel, object, and semantic. Please see 0. 

^Volunteers were asked to make peanut butter and jelly sandwiches. The 
participants wore headgear that simultaneously tracked the movement of their 
eyes and recorded the scene before them. 
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detailed investigation of the object center-bias for objects em¬ 
bedded in naturalistic scenes was conducted by Nuthmann & 
Henderson ||2]0 These authors measured the fixation landing 
positions within objects during free viewing of natural scenes, 
and showed that the preferred viewing location (PVL) for real 
objects in scenes was close to the center of the object (as 
shown in Figure [^. They also found that when compared to 
the PVL for real objects, there was less evidence for a PVL for 
human fixations within saliency proto-objects fSH , identified 
by an extension to the Itti saliency map model. They argued in 
favor of object-based visual attention and proposed that during 
naturalistic scene viewing, the eye-movement control system 
directs eyes in terms of object units. Overall, these findings 
match with previous findings that observers look at the center 
of words while reading p^ . Another piece of evidence comes 
from a work of Elazary & Itti who showed that objects 
are usually more salient than the background. 

Belardinelli & Butz p4l measured the distribution of fix¬ 
ation locations on objects over three tasks: 1) object classi¬ 
fication (one of two objects), 2) mimicking lifting an object 
(lifting task), and 3) mimicking opening an object (opening 
task). They found that fixations were drawn to different task¬ 
relevant locations. Based on this, they suggested that attention 
first chooses objects of interest and then fixations are drawn 
to the most informative points. This result supports previous 
findings on the infiuence of task on attention. Eyes extract 
visual information in a goal-oriented anticipatory fashion even 
when single actions are to be performed on the same object. 

Inspired by the salient object detection models in com¬ 
puter vision (i.e., defining saliency at the level of objects as 
in p5|), Dziemianko et al. p^ applied models of salient 
object detection to fixation prediction, similar to Borji et 
al. p7| . They implemented and evaluated three models of 
salient object detection on fixations over two task^ 1) visual 
counting: counting the number of occurrences of a cued target 
object and 2) object naming: naming objects present in the 
scene. In their analysis, they inserted a Gaussian blob at the 
center of a bounding box around an object. They showed that 
the object-based interpretation of saliency provided by these 
models is a substantially better predictor of fixation locations 
than traditional pixel-based saliency. This result is in alignment 
with findings by Borji et al. (57) 

Xu et al. studied the effects of several types of attributes 
on gaze guidance during free-viewing at three levels: the pixel- 
level, the object-level, and the semantic-lev el. Pixel-level at¬ 
tributes included contrast, edge content, color, etc. Object-level 
attributes included size, convexity, solidity, complexity, and 
eccentricity. Semantic high-level attributes contained smell, 
sound, face, text, taste, touch, watchability, and operability. Us¬ 
ing images with annotated objects and regression, they learned 
which factors were important in predicting fixations (e.g., faces 
and text were more important, but sound and motion less 
so). One of the factors they considered (categorized under 
object- or semantic-level attributes) was object center-bias. 
They fitted a two-dimensional normal distribution to the spatial 

^And also in another recent study j^. 

^This data is available at: http://homepages.inf.ed.ac.uk/keIIer/resources/. 


distribution of the fixations in the object-centered coordinate 
system and used it to weight the object centeij^ Although 
they found that adding object- and semantic-level attributes 
increased fixation prediction performance, unfortunately they 
did not explicitly measure the ‘added value’ of object-center 
bias. 

Several works have used object information to build atten¬ 
tion models at the object level (e.g., @-11^). Some 

of these models propose how attention should be deployed to 
different objects at different times to fulfill a task. Some others, 
similar to our goal here, have explained fixations in the context 
of free-viewing. For example, Kavak et al. | [6^ used a bank of 
object detectors to give higher weight to regions inside objects. 
Recently, Stoll et al. also proposed an approach to account 
for object driven fixations and concluded that objects predict 
fixations better than saliency when combined with bottom-up 
saliency. 

Despite some previous evidence for the object-center hy¬ 
pothesis, three challenges still exist that need to be resolved. 
First: the fact that observers tend to look near the center 
of objects could be because saliency might also be high in 
those regions. In other words, do observers look at the center 
of object simply because saliency is higher there compared 
to at the object boundary? Nuthman et al. did not directly 
control for this confounding factor. Instead, they measured 
the distribution of saliency at salient patches/proto-objects and 
showed that compared to the distinct PVL for real objects, 
there was less evidence for a PVL for human fixations within 
saliency proto-objects. But this analysis does not seem to 
address this confound. Instead, here we measure the magnitude 
of low-level saliency inside the object. In a complementary 
analysis, we combine both saliency and object center-bias to 
see whether or not there is added value. 

Second: how we can define the center of an object? This is 
a challenging task due to variety of object parameters such 
as shape, size, concavity/convexity, symmetry, etc. Almost 
all previous studies have used bounding boxes which might 
not be a good option in many cases (e.g., the center of the 
bounding box may fall outside of the object area for a concave 
object). Further, using bounding boxes causes confusion and 
inaccuracy in assigning fixations to the foreground object 
or background. For example, in the analysis of Nuthman et 
al. in Figure [^b, several points from the background are 
also included. To address this challenge, we first use object 
boundary polygons instead of bounding boxes. Second, we 
apply object center-bias on each individual object from its 
center of mas^ towards the outside. 

Third: this challenge is in regards to the complexity of 
stimulus set, since natural scenes are inherently complex. 
For example, observers may have different viewing behavior 
depending on the complexity of the scene. They may visit 

^They did not specifically mention how they defined the center of an object 
or whether they used bounding boxes. It seems, however, that similar to 
Nuthmann & Henderson 0 and Dziemianko et al. they used bounding 
boxes. 

^The center of mass (CoM) is calculated using the standard methods. The 
X and y coordinates of the CoM are, respectively, the average of the x 
coordinates of the pixels and the average of the y coordinates of the pixels 
that make up the object. 
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Fig. 1: Object-based center-bias, a) An image with a sample annotated object (a basket). Note how loose the bounding box is 
in this case, b) A close up of the object bounding box and fixations (shown in red). Note that some fixations fall outside the 
object and on the background. The center of the object is the origin of the coordinate system for fixations, c) Distribution of the 
horizontal component of landing positions for objects (red circles) and the corresponding distribution of the vertical component 
of within-object landing positions (blue squares). Circles are data and curves are fitted using truncated Gaussians. The vertical 
broken line indicates the center of the object. Horizontal and vertical lines are overlaid, (d) Corresponding smoothed two- 
dimensional viewing location histogram. The intersection of the two broken lines marks the center of the object. Images are 
taken with permission from Nuthmann & Henderson Q. 


the center of the object for an image with few (large) objects 
but may not do so for objects amidst scene clutter. In order 
to answer this question, one needs large amounts of data. To 
address the challenge of complexity, we run our experiment 
over a large amount of data from two datasets with a variety 
of images and objects. 

B. Contributions 

In summary, we offer the following contributions in this 
work: 

1) We verify the hypothesis that “observers tend to look 
near the center of objects in scene free-viewing” and 
establish that this effect is independent of low-level 
bottom-up saliency. 

2) We construct a combined model of object center-bias 
and saliency. To do so, we answer the following ques¬ 
tions: a) How can we construct an object center-bias 
map to emphasize object centers? b) What is the best 
way to combine this map with image saliency (addition 
or multiplication)? 


H. Data 

A. Our Data 

1) Stimuli: Stimuli consisted of 60 color images (30 syn¬ 
thetic, 30 natural). Figure shows some examples of our 
stimuli. Images were resized to 1920 x 1080 pixels by 
adding gray margins while preserving the aspect ratio. We 
intentionally did not include stimuli with persons, animals, or 
faces, mainly because these objects have interesting parts on 
their ends. We chose images from different categories (line 
drawings, 3-D rendered cartoonic images, etc.) with different 
types of objects. Object boundaries were manually traced. Our 
methodology for selecting objects was to only label objects 
that were completely unoccluded in the image. This was done 
so that the analysis of a center bias effect would not be 
infiuenced by objects whose computed center of mass was 
different from the theoretical center of mass. We attempted to 
choose images with less photographer bia^ and with multiple 
objects off the image center, thus reducing the effect of center- 

^Tendency of photographers to frame interesting objects at the center of 
the image. 
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bias on fixations. 

2) Observers: Seventeen observers (4 male, 13 female) 
participated in this experiment (mean age = 20.58, std = 
1.37). Observers were students at the University of Southern 
California (USC) from the following majors: Neuroscience, 
Psychology, Biology, Business, Biomedical Engineering, and 
Accounting. The experimental methods were approved by 
the use’s Institutional Review Board (IRB). Observers had 
normal or corrected-to-normal vision and were compensated 
by course credits. Observers were asked to freely watch the 
images. 

3) Apparatus and procedure: Observers sat 130 cm away 
from a 42 inch monitor screen such that scenes subtended 
approximately 43° x 25° of visual angle. A chin/head rest was 
used to minimize head movements. Stimuli were presented at 
60Hz at a resolution of 1920 x 1080 pixels in random order. 
Eye movements were recorded via an SR Research Eyelink 
eye tracker (spatial resolution of 0.5°) sampling at 1000 Hz. 
Each image was shown for 30 seconds followed by a 5 seconds 
delay (gray screen). The eye tracker was calibrated using a 5- 
point calibration method at the beginning of each recording 
session. 

B. OSIE dataset 

The OSIE (“Object and Semantic Images and Eye¬ 
tracking”) datasej^was created by Xu et al., to explore how 
object and semantic saliency can be used for predicting where 
observers look in free viewing of natural scenes. It contains 
eye tracking data of 15 participants over a set of 700 images 
(for 3 seconds viewing time). Each image has been manually 
segmented into a collection of objects by one person. Semantic 
attributes of objects have also been manually labeled (e.g., 
operability, watchability, text). This dataset introduced two 
novel contributions: Eirst, it contains a large number of object 
categories and several objects have semantic meanings and 
second, the majority of the images contain multiple dominant 
objects. Eigure]^ shows example images from the OSIE dataset 
along with fixations and object annotations. Please refer to Xu 
et al. i) for more details on this dataset. 

OSIE dataset is suitable for our purposes because it has 
a variety of images from different categories. Eurther, object 
boundaries have been carefully annotated on this dataset for a 
large number of objects. 

Eigure illustrates statistics of the OSIE dataset. The 
majority (87.01%) of objects occupy equal or less than 10% of 
the image area. 52.68% of objects contain equal or less than 
10% of the fixations on the image. We observe that normalized 
size of the most salient object (object at the peak of the fixation 
map; 1012 out of overall 5551 object annotations) is usually 
larger than regular objects as shown in Eigure second row. 
74.90% of most salient objects occupy equal or less than 10% 
of the image area. Similarly, only about 5% of most salient 
objects contain equal or less than 10% of the fixations on the 
image. About 14% of the most salient objects contain equal to 
or more than 50% of the fixations in the image. This can also 
be observed from the third row of Eigure which shows the 

^ http: //w w w. ece. nus. edu. sg/stfpage/eleqiz/predicting .html 
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Eig. 3: Statistics of the OSIE dataset, a) histogram of normal¬ 
ized object size, b) histogram of fraction of fixations (number 
of fixations on an object over number of all image fixations), 
c) histogram of normalized salient object size. Salient object 
is the one with the maximum fraction of fixations on it, d) 
similar to b but for salient objects, e & f) plot of fraction of 
fixations as a function of normalized object size. ‘Erequency’ 
on the y-axis indicates the number of occurrences. 


relationship of normalized object size versus the fraction of 
fixations over all object annotations. Insets in Eigure show 
the average annotation map and average fixation map. As in 
other eye movement datasets, a large degree of fixation center- 
bias is observed on this dataset. 

On average, 5.18 and 7.93 objects are annotated over our 
dataset and OSIE, respectively (median: 5 vs. 7). The total 
number of fixations on our dataset is 76,869 (over 60 images). 
This figure for OSIE dataset is 98,321 (over 700 images). 
Eigure shows a histogram of annotated objects and the 
average annotation map over the two datasets. 

III. Measuring object center-bias 

In this section, we verify the object center-bias hypothesis 
by measuring the distribution of fixations inside objects. To 
do so, we need a way to define the center of an object. We 
choose the center of mass of an object as the object center. 
Then, we grow circles from the object center such that each 
circle (tube) contains an additional 10% of the object area. In 
other words, the difference of object coverage between each 
successive pair of concentric circles is 10% of the whole object 
area. We repeat this operation until all object area is covered. 
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Fig. 2: Sample images from our dataset along with annotated objects and fixations of all observers. Notice how certain locations 
inside some objects attract more fixations than others. 







Fig. 4: Sample images from the OSIE dataset along with object annotations and fixations. Due to shorter presentation times 
(5 seconds vs. 30 seconds in ours), there are fewer fixations in OSIE images than in ours. 



Eigure |^b inset shows an example of this operation. We call 
this map, ’’object center-bias map” and denote it by ”0”. 

Eor each of the circular regions (tubes), we then count the 
number of fixations that fall on that region. Eigure |^a shows 
the distribution (converted to probability density function) of 
fixations over the 10 circles averaged over all objects on each 
dataset. As it shows, as one moves away from the object 
center toward the object boundary, the probability of fixations 
declines (almost linearly). 

Eigure |^b shows the distribution of saliency (average 
saliency inside each tube) using the AWS saliency model | [65] 
from center to boundary of the objects. Here, again we observe 
a decline in saliency as moving from object center toward the 
object boundary. Similar to fixations, this decline is sharper 
on our dataset than on the OSIE. This result indicates that 
on average, saliency is higher at the object center which, 
as discussed in the introduction, may explain some of the 
additional fixations in that region. To answer whether saliency 
can explain all fixations or not (i.e., discounting the effect of 
saliency confound), in the next section we follow a modeling 


approach by adding these two components. The rational is 
as follows: if we observe a boost in saliency in predicting 
fixations by adding object center-bias, we can then conclude 
that object center-bias has an (independent) added value to 
what early saliency already offers. 

To explore the generality of the hypothesis over all objects 
and the factors that it may depend on, we define an object 
center-biased index which is the sum of fixation densities 
inside the first 5 inner-most circles/rings over the sum of 
fixation densities inside all ten circles/rings (i.e., over the entire 
object): 

ohj _cnt_idx = —(1) 

Ei=iPi 

where pi is the density of fixations inside the i-th tube. The 
higher the obj_cnt_idx, the more tendency of fixations towards 
the object center. Eigure. [ 7 ] demonstrates the histogram of 
obj_cnt_idx indices on our dataset. Eor the majority of objects 
(200 out of 311) this index is higher than 0.5, which would 
be the value if fixations were distributed uniformly over the 
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Fig. 5: (a) Histograms of annotated objects per image over the two datasets. Images in the OSIE dataset contain more object 
annotations on average compared to our dataset, (b) Average object annotation map over two datasets. 


entire object. As expected, objects with high obj_cnt_idx often 
have content at the image center (Figure. [7]b, e.g., book, 
grandfather clock) while objects with low obj_cnt_idx usually 
have imbalanced/tilted features on one side (Figure. [Tjc, e.g., 
sword, microphone). We notice that affordance and shape of 
the object also influences where people look inside it. For 
example, in the microphone case, there are more features 
around its tip including salient edges which differ from their 
neighbors (hence high saliency there) which attracts more 
flxations (similar argument for the sword). Replacing the 
circles with bounding boxes (i.e., rectangular tubes) shows the 
same pattern of results. 

IV. Our augmented saliency model 

Having seen that object center-bias effect exists on a major¬ 
ity of objects, in this section we propose a simple combined 
model of saliency and object center-bias. This model, in 
addition to having better flxation prediction accuracy, also 
helps further investigate the accuracy of the object center- 
bias hypothesis. We follow the previous line of research that 
linearly combines cues for computing saliency (e.g., CD, 
P^). Our model is simply a weighted combination of the 
saliency map and the object center-bias map as follows: 

5'M = (1-^) xS'-h^xO, /3 = 0:0.1:1 (2) 

where S is the saliency map, O is the object center-bias map, 
and /3 is a parameter that controls the relative magnitude of 
the two maps. /3 = 0 is just the pure bottom-up saliency map 
(AWS model), and /3 = 1 is the pure object center-bias map. 
Through experiments, we learned that adding the term S x O 
did not improve our results, so we discard it here. The S, O, 
and resulting SM maps are all normalized (sum to 1). 


Figure [^a shows the NSS[^ scores of the combined model as 
a function of parameter p. As P increases, the NSS peaks and 
then declines over both datasets. Looking at the optimal (3 for 
each dataset, we And that they are close to each other, 0.15 for 
our data and 0.35 for OSIE, which result in NSS scores of 1.45 
and 1.705, respectively. This means that if we were to train 
the model over our data and test it on the OSIE dataset (or 
vice-versa), we would have achieved a better performance than 
both saliency and object center-bias maps on the destination 
test dataset. In other words, if we were to apply the best /3 
from one dataset to another, results would be still better than 
both saliency and object center-bias models. This means that 
our model generalizes well over datasets. 

Eigure also shows higher performance over OSIE dataset 
compared to our dataset which can be attributed to two causes: 
1) more objects are annotated in OSIE images than our images 
which results in a higher contribution of objects (mean 5.18 on 
our data vs. 7.93 over OSIE), and 2) viewing time is longer 
on our data which might have caused subjects to be driven 
more by the image background. We believe that the second 
cause is a more plausible explanation of this effect as we did 
not see a trend in performance as a function of the number 
of annotated objects on a scene. Eurther, while the number 
of images over OSIE dataset is about 12 times higher than 
our data, the number of flxations is nearly the same. Longer 
viewing time leads to flxations that fall on the background 
clutter and this results lower prediction accuracy since these 
flxations are not accounted by the object annotations. 

Eigure [^b shows the results over both datasets for saliency 

^Normalized Scanpath Saliency j^, which is the average of the response 
values at human eye positions in a model’s saliency map that has been 
normalized to have zero mean and unit standard deviation. NSS = I indicates 
that the subjects’ eye positions fall in a region whose predicted saliency is one 
standard deviation above average. NSS < 0 indicates that the model performs 
no better than picking a random position, and hence is at chance in predicting 
human gaze. 
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Fig. 6: (a) Distribution of fixations over the object area from the inner-most ring (1 in x-axis) to the outer-most ring (10 
in x-axis). Note that the difference in rings adds 10% to the object area and not the entire circle (i.e., it is incremental), (b) 
Distribution of saliency using the AWS saliency model, over both datasets. Inset shows an example object and the corresponding 
object map (denoted OBJ). 



Fig. 7: (a) Histogram of object center-bias indices over our data. An index above 0.5 means more center-bias, (b) Some objects 
with high indices, (c) Some objects with low indices. These objects usually have a salient part on one of their ends. 


alone, object map alone, and their optimal combination. Aver¬ 
age NSS for AWS, OBJ (i.e., object center-bias map O), and 
the combined model (with optimal /3) over our data in order 
are: 1.3302, 1.0828, and 1.4501. Combined model significantly 
outperforms the other two models (t-test, combined vs. AWS, 
p = 1.9301e-06; combined vs. OBJ, t-test, p = 2.9015e-16). 
AWS model here significantly outperforms the OBJ model (t- 
test; p = 7.0320e-06). 

The average NSS for AWS, OBJ, and combined model 
over OSIE dataset in order are: 1.4530, 1.4554, and 1.7051. 
The combined model significantly outperforms the other two 
models (t-test, combined vs. AWS, p =3.1412e-69; combined 
vs. OBJ, t-test, p = 1.9295e-73). The difference between AWS 
and OBJ models is not statistically significant here (t-test; p 
= 0.9136). The difference between the combined model and 


the saliency model is smaller in our dataset compared to the 
OSIE dataset (9% vs. 17.27%). This could be due to the larger 
number of annotated objects in the OSIE images than in the 
images in our dataset. Interestingly, on OSIE, all tested values 
of (3 other than 0 and 1 are above both AWS and the object 
center-bias models. Our object center-bias model is essentially 
similar to the model proposed by Einhauser et al. | [47| with the 
difference that here we emphasize the object center instead of 
uniformly distributing activity over the entire object. Eurther, 
there is no object weighting based on memory recall (i.e., the 
same weight for all objects). 

Eigures[^ andshow scatter plots of saliency vs. combined 
model over our data and OSIE, respectively. Each dot in this 
plot represents the NSS score for one image. Over our dataset, 
for 91.67% of images, the combined model outperforms the 
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AWS saliency model. This figure for the OSIE is 80.71%. 
These values for the combined model vs. object center-bias 
map over our data and OSIE, in order are 83.33% and 77.71%. 
On both datasets for less than 50% of the images, the object 
map wins over the saliency map (20% on our data and 
48.71% over OSIE). Eor images where the combined model 
outperforms the saliency model significantly, there are usually 
few objects in the scene (e.g., Eigure p^b, images 1, 2, and 
3) and scenes do not usually have much background clutter. 
Eor images where the combined model performs worse than 
saliency, usually interesting parts of the object do not happen 
at the object center (e.g., in people, where the entire body is 
annotated as one object, face is the most interesting part but 
it is not at the center; Eigure [T^b, 4th image). 

Eigure [TT] shows the NSS score for three different types of 
object center-bias including linear weighting (our implementa¬ 
tion so far), constant weighting (uniform distribution of weight 
over the entire object), and Gaussian weighting (which weights 
the 10 circles/rings using a normalized Gaussian function) over 
(a) our data and (b) OSIE dataset. Results do not show a big 
difference in performance or in optimal /3. We find that linear 
weighting of object center-bias is the best strategy consistently 
over both datasets. 

We also noticed that replacing polygons with bounding 
boxes (similar to |^) over OSIE dataset results in NSS of 
1.112 which is above NSS of 1.083 using polygons but overall 
does not significantly improve the combination performance. 
The higher performance using bounding box is because it 
better accounts for fixations around the edges of objects. 

V. Discussion 

In this work, we verified the validity of the object center- 
bias hypothesis in the context of free-viewing. We believe 
there might be an even stronger effect of object center-bias in 
the presence of a task. According to the cognitive relevance 
theory (see | [46| ) objects are more important when there is 
a task (compared to free-viewing). Some interesting tasks in 
this regard include: 1) Asking subjects to count the number of 
objects in a scene, 2) Asking subjects to manipulate objects 
(e.g., in a coffee-making task). In the latter, subjects may 
also look at those features that are related to the task (e.g., 
handle of the kettle) as suggested in Belardinelli & Butz | [54| . 
It has also been shown that in object categorization, human 
subjects fixate on informative parts of objects (See Hartendorp 
et al. |[67|). Some other interesting tasks here include: aesthetic 
judgment, interestingness judgment, visual search, and scene 
memorization. 

Here, we discuss some important parameters for further 
investigation of the object-center hypothesis that should be 
taken into account in future studies. The first parameter is 
scene clutter. The manner in which humans attend to objects 
might be different depending upon whether they are viewing 
a simple scene with few objects or a complex scene with 
several objects and/or an amorphous background. In a complex 
scene, viewers may quickly scan the image in order to collect 
more information which may cause them to be driven to 
spatial outliers. The second parameter, related to the first one. 


is scale. If objects are shown to observers in a large scale 
(and hence larger objects sizes), then they may not tend to 
look at the empty central regions inside the object specially 
if they don’t contain features (imagine close up view of a 
white board). The third parameter concerns object symmetry. 
It has been shown in Kootstra et al. | [68| that people tend 
to look at the center of symmetrical objects. The question 
that arises here is “Are object center-bias and symmetry two 
different cues?”. In other words, “Do people look at the 
center of asymmetrical objects?”. The fourth parameter regards 
viewing constellated objects made of several components. 
Object concavity/convexity is the fifth parameter. Eor example, 
what happens if the center of the object lies outsides the 
object? 

To investigate above-mentioned parameters we recommend 
two approaches: Eirst, more systematic studies over simple 
synthetic scenes are desirable. Eor example, imagine a plain 
object with no features inside. As soon as a salient point/region 
is inserted somewhere inside the object (but off-center), most 
likely viewers will not look at the center anymore (or will look 
less). This is in alignment with our analysis in this paper which 
was testing whether saliency peaks at the center of the objects 
in the real world or not. Another similar analysis would be 
collecting objects with no salient points inside and test whether 
viewers still look around the object center (similar to some of 
our images). Overall, the main difficulty in investigating the 
object center-bias arises from the fact that there is large variety 
of objects in natural scenes. Indeed, the object-center effect is 
stronger for some certain types of objects. Second, we believe 
that large scale object annotated datasets (e.g., datasets by 
Greene ^ Cheng et al., ItoIP^ and Li et al., |7l||^^| ) can be 
very useful to understand how saliency and object information 
are related in scene viewing and understanding. 

In contrast to Nuthmann & Henderson’s conclusion 
which stated that “... attentional selection in scenes is object- 
based. Saliency only has an indirect effect on attention, acting 
through its correlation with objects ...”, our results suggest that 
both low-level saliency and object information (here object 
center-bias) contribute (although correlated) to attention during 
scene free-viewing. This finding aligns with our previous 
results in Borji et al. | |48| where we criticized the hypothesis by 
Einhauser et al. | [47| that “Objects predict fixations better than 
early saliency” and showed that saliency is a better predictor 
of fixations in free-viewing^ Einhauser et al., built a map 
with object regions weighted by their recall frequency in a 
scene viewing (for memory testing) task. Although the debate 
whether saliency or objects are better predictors of fixations 
is still ongoing, the bottom-line is that both factors contribute 
independently to guiding fixations. 

Is object center-bias a bottom-up or top-down cue? It 
is true that object center can be computed by a sim- 


^^http://Stanford.edu/ mrgreene/Iabelme.html 
^ ^ http://mmcheng.net/gsal/ 

^^http://cbi.gatech.edu/saIobj/ 

^^At least with the way that Einhauser et al. used objects to build a model. 
If they had added object-center bias to their model, most likely they would 
have achieved much better results compared with saliency alone (i.e., the OBJ 
model in our work). 
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Fig. 8: (a) NSS score of the combined model as a function of /3 = 0 corresponds to pure saliency and ^ = l corresponds to 
the pure object map. Note that over the whole range of p values, the combined model performs better than both the saliency 
and object models over the OSIE dataset. Over our data, since some objects are annotated and not all, a larger magnitude of 
the saliency model is necessary to make a superior combined model, (b) Average NSS score of the models over our data and 
the OSIE dataset. Error bars indicate standard error of the mean (s.e.m). 


Our Data OSIE 




(a) 





Eig. 9: (a) Image-wise comparison between the NSS score of saliency vs. a combined model for all images in our data. Each 
dot is for one image. Performance of the combined model is with the optimal Even with a small number of annotated objects 
per image, we observe an increase in performance of the combined model. The percentage of images for which the combined 
model performs better than each individual component is also shown. Eor 55 images, the combined model outperforms the 
AWS model (50 with respect to the OBJ model), (b) Two images with their corresponding prediction maps. Eor the first image, 
the saliency map already explains many of the fixations (i.e, high NSS) so inserting object center-bias, although helpful, does 
not add much to the score. Eor the second image, the object map brings a lot of value. 


pie computationally-efficient early processing (using proto¬ 
objects | [5T| ) but the mechanism that chooses to drive saccades 
to the center of objects (even in presence of more salient edge 
regions) seems to be a top-down process. By analogy to the 
face cue that attracts attention and gaze, there might be some 
dedicated neural circuitries for driving saccades to the object 
center. This is in alignment with the object-based theory of 
attention which states that objects are the unit of attention. 
Actual implementation of this mechanism needs to be further 
investigated by neurophysiology and psychophysics studies. 

Are eye movements driven by objects or by early saliency? 
And by extension, is attention object-basedjfD, 0, EZ)’ 0’ 
| [64| , |[72|-|[76| or saliency-driven |[4Q|-|[44|? Based on our 
results here (as well as previous studies ||3]|71 [^ , p6| , |[6^- 
|[64|), we believe that both forms of attention guidance do 


occur. However, this needs to be studied further, for example 
by carefully controlling the scene complexity and background 
clutter. One approach would be using objects with no texture 
inside them (e.g., shapes) and see whether observers look at 
object centers. One piece of evidence that eye movements 
are driven by early saliency comes from the fact that eye 
movements are driven to salient regions in scenes where there 
are no well-defined objects (e.g., fractal scenes |[66|). Evidence 
in favor of object-based attention comes from the finding that 
fixations are driven to the center of objects 0, (^. The 
interplay between these two forms of attention in daily life 
still remains to be investigated further. 

Are saliency and object center-bias independent cues? In 
other words, do they both contribute to guiding gaze? Here, 
we showed that a simple linear combined map of both cues 
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AWS + Obj > Obj [544 out of 700 = 77.71 %] 
AWS + Obj > AWS [565 out of 700 = 80.71 %] 


Obj > AWS [341 out of 700 = 48.71 %] 



AWS 


NSS = 2.1959 



NSS = 4.4297 

NSS = 4.4907 

o 



NSS = 3.4294 NSS = 4.8750 NSS = 5.0488 



Fig. 10: Similar to Figure but over the OSIE dataset, (a) NSS score of saliency vs. combined model, b) Sample images with 
their corresponding prediction maps. These images were chosen to show cases where map combination increases performance 
(compared to AWS) drastically (images 1 & 2), moderately (3), and a case where combination slightly hinders performance 
(4). On image 4, each person was annotated as one object and emphasis was placed at the center of their body while fixations 
were drawn to their heads. Better performance would have been achieved if human heads were annotated on this image. 


outperforms each individual map. This indirectly shows that 
there is an added value in their combination which means 
that these maps are not subsets of each other. In a more 
direct analysis, in a parallel study to ours, Stoll et al. | [64| 
have addressed this question. They modified their stimuli by 
fading edges of objects (effectively reducing saliency) and then 
measured the performance of early saliency models versus an 
object center-biased model. They showed that performance 
of early saliency models degraded drastically over modified 
stimuli while performance of object center bias remained the 
same. From this, they concluded that saliency and object center 
bias are two different cues. 

Some of the saliency models that have done well in previous 
benchmarks (e.g., |T^ ) might have implicitly emphasized 
object center more (e.g., | [65| , |[77|). For example, the AWS 
model generates some notion of objecthood using proto¬ 
objects and whitening. Thus, without being fully aware of the 
object center-bias hypothesis, these models have been able 
to predict fixations better. Explicit integration of this effect 
into saliency models (similar to our work here) or using more 
recent models (e.g.. Boosting or Conditional Random Fields 
(CRF)) could be an interesting direction for future modeling. 

In addition to datasets used here, some other annotated 
datasets exist which can be used to further investigate the 
relationships between bottom-up saliency and object center- 
bias and also study the above-mentioned factors. Three ex¬ 
amples include: 1) the dataset by Greene | [69| which is 


mainly designed for scene categorization and understanding 
research (http://stanford.edu/^mrgreene/labelme.html). A total 
of 48,167 objects have been hand-labeled in 3,499 scenes 
from 16 categories using the LabelMe tool, 2) the UCSB 
dataset created by Koehler et al. ||78]p^ This dataset contains 
800 images. One hundred observers performed four tasks 
(22 performed explicit saliency judgment, 20 performed free 
viewing, 20 performed saliency search, and 38 performed a 
cued object search task), and 3) a dataset recently introduced 
by Li et al. | [79| known as PASCAL-S. These authors first 
segment all objects and then assign saliency orders to objects. 
This dataset contains eye movements of 8 observers over 850 
images from the PASCAL VOC dataset ||^. 


VI. Conclusion 

In this study, we first evaluated the object center-bias 
hypothesis by Henderson Q and Nuthmann & Henderson Q 
over two datasets in the free-viewing task. We found (results in 
section III) that both fixation density and bottom-up saliency 
are high at the center of objects, making saliency a potential 
confounding factor for the object-center hypothesis. To address 
this confound, we then proposed a combined model of saliency 
and object center-bias that outperforms each component sig¬ 
nificantly. This proves the object center-biased hypothesis and 

^^https://labs.psych.ucsb.edu/eckstein/miguel/research_pages/ 

saliencydata.html 
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(a) (b) 



Fig. 11: NSS score with different types of object center-bias emphasis over (a) our data, (b) OSIE dataset. Results do not show 
a big difference in performance or in optimal p. It seems that linear weighting is the best strategy over both datasets. 


indicates that both saliency and object information contribute 
to gaze guidance in scene viewing. Although both saliency 
and object center-bias correlate with each other, neither is 
a subset of the other and that is why their combination 
performs better than each cue individually. We also noticed 
that this finding is consistent whether using bounding boxes 
or polygons, and using different saliency models or weighting 
approaches. Overall, our results support those of recent works 
that object center-bias improves fixation prediction (e.g., Xu 
et al., and Stoll et al., | [64| ) which further support the 
hypothesis that fixations are driven by objects as well as early 
saliency. 

We hope that our work will open new directions to under¬ 
stand strategies that humans use in object and scene observa¬ 
tion and will help construct more predictive saliency models 
in the future. 
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